Data-Driven Insights of Airbnb Market in Seattle

In this project, I will be focusing on exploring the key insights of Seattle Airbnb market from the perspectives of interactive data visualization and text mining.

Business Understanding

The Seattle Airbnb dataset contains files about Airbnb listings in Seattle, calendar availability for each of these listings, user reviews on the listings as well as the geometry information of each neighbourhood. Using this dataset, I attempt to answer the following business questions from three aspects:

Data Understanding

The Seattle Airbnb dataset used here is obtained from insideairbnb.com and compiled on October 25, 2020. Let's first get an overview of the dataset.

LISTINGS data

Overall, there are 4335 listings in Seattle on October 25, 2020. The columns I'll be using below include id, neighbourhood_cleansed, latitude, longitude, accommodates and price. We notice that none of the columns has null values.

CALENDAR data

The calendar file has 365 records for each listing, i.e., the price and availablity by date for each listing is specified 365 days ahead.

REVIEWS data

The reviews file holds reviews for each listing. We can see some comments are missing. We will only use comments avalable for the sentiment analysis.

NEIGHBOURHOODS data

The neighbourhoods file covers the geometry information of 91 neighbourhoods in Seattle. Just as the calendar file, the information present in the neighbourhoods file is also complete.

Data Preparation and Exploration

Our data preparation step becomes easy since either the information we need is complete or there is no need to impute missing values. The only data wrangling we need to perform is to convert the type of the price columns from both the listings and calendar files to the float type.

Before we move to the step of data modeling, we conduct a brief explorary data analysis to gain a sense of the primary locations, room types and accommodation of the listings.

Locations on the map

As expected, most listings are located in the center area of the city. This map is interactive, and we can zoom-in on the clusters to eventually find the individual locations of the listings.

Room types

In Seattle, a majority of Airbnb listings is entire home/apartment. The listings for private room and hotel room are very rare.

Accommodates

It can be seen that most listings are for 2 people.

Data Modeling and Results

Q1: Where are the listings located and what are the average prices of these listings by neighbourhood?

Number of listings by neighbourhood

The spatial distribution of listings shows listings are concentrated in two areas. One is Belltown-Center Business District-Broadway neighbourhoods, which represents the downtown area. The other one is the Wallingford-University District area, which includes the campus of University of Washington (UW). Both downtown area and UW campus are attractive choices for tourists to visit.

Average daily price by neighbourhood

To compare average daily price by neighbourhood, we only select the neighbourhoods including at least 5 listings with the most common type of accommodation, which is accommodation for 2 people.

It can be seen that the costliest neighbourhoods are also in the downtown area due to high demand, while the average rental prices of UW campus area are far cheaper. Other relatively expensive places, such as West Woodland and North Beach/Blue Ridge are waterfront neighbourhoods.

Q2: What is the availability of the accommodations and what is the price trend in the near future?

Availability over time

It shows that there are generally more accomodations available up to three months ahead than further into next year. Part of the reason might be that hosts are more actively updating their calendars in this timeframe. Besides, due to Seattle's rainy winter, most of people prefer to visit Seattle in summer or autumn instead.

Average price by date

We find that the peak of average daily price for a 2-person place occurs on September 4 next year at about $132, and the cyclical pattern is due to higher prices in weekends.

Q3: What do tourists like about their accommodations and what do they usually complain about if they had a bad experience?

Last, we want to find out which housing properties (e.g., proximity of restaurants, shops, hygiene, safety, etc.) lead to a good rental experience, and explore some of the worst reviews. Here, we adopt a python package VADER which is a lexicon and rule-based sentiment analysis tool, the compute the polarity score of the comments.

The advantages of VADER over traditional methods of sentiment analysis include:

Reference: https://medium.com/analytics-vidhya/simplifying-social-media-sentiment-analysis-using-vader-in-python-f9e6ec6fc52f

We can see that an overwhelming proportion of reviews are positive. Now let's take a closer look at a few examples of the best and worst reviews.

We find:

Here, instead of counting the frequency of words in the corpus, we use the term frequency-inverse document frequency (TF-IDF) to rank words. TF-IDF penalizes overused terms, which helps reduce non-informative words.

Note that we pass some additional parameters to the TF-IDF class:

Reference: https://cloud.google.com/blog/products/gcp/problem-solving-with-ml-automatic-document-classification

Next, let’s compare the features of accommodations that received positive vs. negative reviews via word clouds.

It’s not surprising the top positive terms include: walk, restaurants, food, shop, bus, park, view, safe and quiet.

In contrast, the top negative terms include: dirty, broken, noise, smell, old and small. Besides, it seems the guests often have issues with amenities, such as shower, door, towel, window, toilet and sheet.